[Obtain the dataset at this address] (https://www.kaggle.com/datasets/everydaycodings/produce-prices-dataset)
Let’s do some exploratory data analysis and data cleanup to start things off.
produce <- read.csv("ProductPriceIndex.csv")
print(summary(produce))
productname date farmprice atlantaretail chicagoretail losangelesretail
Length:15766 Length:15766 Length:15766 Length:15766 Length:15766 Length:15766
Class :character Class :character Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character
newyorkretail averagespread
Length:15766 Length:15766
Class :character Class :character
Mode :character Mode :character
print(head(produce))
Notice that all the features are characters. Some of them, like “farmprice,” would make more sense as floats; additionally, the features that represent prices have a pesky dollar sign in front. So let’s fix that first.
produce$farmprice <- as.numeric(gsub("\\$", "", produce$farmprice))
produce$atlantaretail <- as.numeric(gsub("\\$", "", produce$atlantaretail))
produce$chicagoretail <- as.numeric(gsub("\\$", "", produce$chicagoretail))
produce$losangelesretail <- as.numeric(gsub("\\$", "", produce$losangelesretail))
produce$newyorkretail <- as.numeric(gsub("\\$", "", produce$newyorkretail))
produce$averagespread <- as.numeric(gsub("\\%", "", produce$averagespread))
Warning: NAs introduced by coercion
print(head(produce))
Awesome. Another cleanup we’ll perform is adjusting the “date” feature to date-time format, which will make it easier to perform computations on.
produce$date <- as.POSIXct(produce$date, format = "%Y-%m-%d")
print(head(produce))
Perfect. Finally, let’s remove all the null values, if any.
produce <- na.omit(produce)
Now we’re done with our initial data cleanup. We model the variable “averagespread” based on one quantitative predictor: “farmprice”.
library(ggplot2)
ggplot(produce, aes(x = farmprice, y = averagespread)) +
geom_point() +
labs(title = "Scatterplot of Farm Price vs. Average Spread",
x = "Farm Price",
y = "Average Spread") +
theme_minimal() +
geom_point(size = 0.1)
This is a pretty strong graph, but let’s check out the other variables before building a regression model. We perform feature engineering by taking the average retail price, then we model “averagespread” based on the new feature.
produce$avgretail <- rowMeans(produce[, c("atlantaretail",
"chicagoretail",
"losangelesretail",
"newyorkretail")],
na.rm = TRUE)
ggplot(produce, aes(x = farmprice, y = avgretail)) +
geom_point() +
labs(title = "Scatterplot of Farm Price vs. Average Retail",
x = "Farm Price",
y = "Average Retail Price") +
theme_minimal() +
geom_point(size = 0.1)
This doesn’t look very promising, so let’s make the same model but we’ll isolate specific products.
#Scatterplot for strawberries
produce_strawberries <- subset(produce, productname == "Strawberries")
ggplot(produce_strawberries, aes(x = farmprice, y = avgretail)) +
geom_point() +
labs(title = "Scatterplot of Farm Price vs. Average Retail Price (Strawberries)",
x = "Farm Price",
y = "Average Retail Price") +
theme_minimal() +
geom_point(size = 0.1)
#Scatterplot for potatoes
produce_potatoes <- subset(produce, productname == "Potatoes")
ggplot(produce_potatoes, aes(x = farmprice, y = avgretail)) +
geom_point() +
labs(title = "Scatterplot of Farm Price vs. Average Retail Price (Potatoes)",
x = "Farm Price",
y = "Average Retail Price") +
theme_minimal() +
geom_point(size = 0.1)
#Scatterplot for oranges
produce_oranges <- subset(produce, productname == "Oranges")
ggplot(produce_oranges, aes(x = farmprice, y = avgretail)) +
geom_point() +
labs(title = "Scatterplot of Farm Price vs. Average Retail Price (Oranges)",
x = "Farm Price",
y = "Average Retail Price") +
theme_minimal() +
geom_point(size = 0.1)
While we can see some trends, it might be more worth our while to model “farmprice vs averagespread” for each product.
#Scatterplot for strawberries
ggplot(produce_strawberries, aes(x = farmprice, y = averagespread)) +
geom_point() +
labs(title = "Scatterplot of Farm Price vs. Average Spread (Strawberries)",
x = "Farm Price",
y = "Average Retail Price") +
theme_minimal() +
geom_point(size = 0.1)
#Scatterplot for potatoes
ggplot(produce_potatoes, aes(x = farmprice, y = averagespread)) +
geom_point() +
labs(title = "Scatterplot of Farm Price vs. Average Spread (Potatoes)",
x = "Farm Price",
y = "Average Retail Price") +
theme_minimal() +
geom_point(size = 0.1)
#Scatterplot for oranges
ggplot(produce_oranges, aes(x = farmprice, y = averagespread)) +
geom_point() +
labs(title = "Scatterplot of Farm Price vs. Average Spread (Oranges)",
x = "Farm Price",
y = "Average Retail Price") +
theme_minimal() +
geom_point(size = 0.1)
These plots demonstrate that “farmprice vs averagespread” has a distinct trend resembling the graph 1/x. Let’s compare 2 models: a linear regression model and a hierarchical model, starting with the former. We’ll perform one more data cleanup: we’ll remove all the rows where “farmprice” = 0 because our model will encounter an error for those values, since 1/0 is undefined.
produce <- produce[produce$farmprice != 0, ]
# install.packages("bayesrules")
library(bayesrules)
# install.packages("tidyverse")
library(tidyverse)
# install.packages("bayesplot")
library(bayesplot)
# install.packages("tidybayes")
library(tidybayes)
# install.packages("rstanarm")
library(rstanarm)
# install.packages("broom.mixed")
library(broom.mixed)
# install.packages("gridExtra")
library(gridExtra)
library(dplyr)
produce_model <- stan_glm(
averagespread ~ I(1/farmprice) + productname,
data = produce, family = gaussian,
prior_intercept = normal(25, 5),
prior = normal(0, 2.5, autoscale = TRUE),
prior_aux = exponential(1, autoscale = TRUE),
chains = 4, iter = 5000*2, seed = 84735,
prior_PD = FALSE)
Regression parameters:
head(as.data.frame(produce_model), 3)
Posterior regression structure: scatterplot of farmprice vs average spread
produce %>%
add_fitted_draws(produce_model, n = 100) %>%
ggplot(aes(x=farmprice, y=averagespread, color=productname)) +
geom_line(aes(y = .value, alpha = .1,
group = paste(productname, .draw))) +
geom_point(data = produce, size = 0.1)
The structure seems to mimic the trend in the original scatterplot.
Plotting posterior predictive models:
averagespread_prediction <- posterior_predict(
produce_model,
newdata = data.frame(farmprice = c(10, 10),
productname = c("Potatoes", "Strawberries")))
mcmc_areas(averagespread_prediction) + xlab("farmprice") +
scale_y_discrete(labels = c("Potatoes", "Strawberries"))
Visual examinations: MCMC trace plots, density overlay, autocorrelation function We plot 3 products because there are too many to plot.
par(mfrow = c(2, 2))
mcmc_trace(produce_model, pars=c("productnameStrawberries", "productnamePotatoes", "productnameOranges"), size = 0.5)
mcmc_dens_overlay(produce_model, pars=c("productnameStrawberries", "productnamePotatoes", "productnameOranges"))
mcmc_acf(produce_model,pars=c("productnameStrawberries", "productnamePotatoes", "productnameOranges"))
Warning: The `facets` argument of `facet_grid()` is deprecated as of ggplot2 2.2.0.
Please use the `rows` argument instead.
The MCMC looks stable!
Numerical examinations: effective sample size and Rhat
neff_ratio(produce_model)
(Intercept) I(1/farmprice) productnameAvocados
0.10145 0.68360 0.12800
productnameBroccoli Bunches productnameBroccoli Crowns productnameCantaloupe
0.11595 0.11865 0.14330
productnameCarrots productnameCauliflower productnameCelery
0.11755 0.11685 0.11770
productnameFlame Grapes productnameGreen Leaf Lettuce productnameHoneydews
0.14440 0.11955 0.14430
productnameIceberg Lettuce productnameNectarines productnameOranges
0.11190 0.15530 0.12165
productnamePeaches productnamePlums productnamePotatoes
0.15785 0.18305 0.12455
productnameRed Leaf Lettuce productnameRomaine Lettuce productnameStrawberries
0.11440 0.11675 0.11915
productnameThompson Grapes productnameTomatoes sigma
0.17805 0.14475 0.81525
rhat(produce_model)
(Intercept) I(1/farmprice) productnameAvocados
1.0029506 1.0000178 1.0022513
productnameBroccoli Bunches productnameBroccoli Crowns productnameCantaloupe
1.0025763 1.0024525 1.0020801
productnameCarrots productnameCauliflower productnameCelery
1.0024398 1.0022483 1.0024915
productnameFlame Grapes productnameGreen Leaf Lettuce productnameHoneydews
1.0020386 1.0025774 1.0022127
productnameIceberg Lettuce productnameNectarines productnameOranges
1.0026475 1.0019266 1.0022810
productnamePeaches productnamePlums productnamePotatoes
1.0018878 1.0015216 1.0024681
productnameRed Leaf Lettuce productnameRomaine Lettuce productnameStrawberries
1.0024317 1.0023283 1.0022952
productnameThompson Grapes productnameTomatoes sigma
1.0015973 1.0026010 0.9999447
All neff_ratios are >0.1, and all rhats are 1<rhat<1.05, so we know our MCMC has worked well.
Posterior credible intervals:
posterior_interval(produce_model, prob = 0.90)
5% 95%
(Intercept) 52.591079 77.33434
I(1/farmprice) 47.690885 49.65273
productnameAvocados -61.941457 -32.89030
productnameBroccoli Bunches 58.370767 86.28399
productnameBroccoli Crowns 64.134204 91.84403
productnameCantaloupe 36.806662 67.85688
productnameCarrots -45.855134 -17.32926
productnameCauliflower 83.521263 111.41925
productnameCelery 71.702124 99.52841
productnameFlame Grapes -6.789946 23.66940
productnameGreen Leaf Lettuce 161.433185 189.65963
productnameHoneydews 24.871879 56.73218
productnameIceberg Lettuce 59.644561 87.44497
productnameNectarines 68.322882 102.11667
productnameOranges -179.629902 -148.74820
productnamePeaches 86.123902 119.81075
productnamePlums 26.569034 62.46070
productnamePotatoes 144.654829 172.78067
productnameRed Leaf Lettuce 180.928227 209.23566
productnameRomaine Lettuce 168.767226 196.74644
productnameStrawberries 1.247134 29.09246
productnameThompson Grapes 2.030928 36.92291
productnameTomatoes 102.637038 133.92055
sigma 120.593366 122.86748
Most of the intervals are not wide and do not include 0, making our model pretty confident.
Posterior predictive check:
pp_check(produce_model)
This doesn’t look very promising. Our model only mildly mimics the structure.
Let’s predict a set of values. Visualizing the posterior predictive model for the averagespread at farmprice = $3 for 5 random products:
products <- sample(unique(produce$productname), 5)
produce_subset <- subset(produce, productname %in% products & farmprice == 3)
new_data <- data.frame(
productname = products,
farmprice = c(3, 3, 3, 3, 3)
)
predict_product <- posterior_predict(
produce_model,
newdata = new_data)
head(predict_product)
1 2 3 4 5
[1,] -85.87958 316.2543 369.3547 99.758310 120.302357
[2,] 74.26453 294.9136 152.6814 -28.284534 115.777182
[3,] -52.16765 291.1514 331.7296 161.322007 229.373504
[4,] 230.28737 321.5352 267.5527 241.136481 -166.999399
[5,] -17.90543 201.8585 386.2392 266.321521 -121.350961
[6,] -197.70137 229.5699 372.7320 5.220148 -9.568815
mcmc_areas(predict_product, prob = 0.8) +
ggplot2::scale_y_discrete(
labels = products)
The predictive check demonstrates that our model may not be as accurate as we hope.
We can reason that each scatterplot of farmprice vs averagespread depends on the specific productname; as such, for comparison, we’ll build a model that assumes they interact.
interaction_model <- stan_glm(
averagespread ~ I(1/farmprice) + productname + productname:I(1/farmprice),
data = produce, family = gaussian,
prior_intercept = normal(25, 5),
prior = normal(0, 2.5, autoscale = TRUE),
prior_aux = exponential(1, autoscale = TRUE),
chains = 4, iter = 20000*2, seed = 84735)
Posterior summary statistics:
tidy(interaction_model, effects = c("fixed", "aux"))
Visualizing the posterior model structure:
produce %>%
add_fitted_draws(interaction_model, n = 200) %>%
ggplot(aes(x = farmprice, y = averagespread,
color = productname)) +
geom_line(alpha = 0.1, aes(y = .value,
group = paste(productname, .draw)))
This model also seems to mimic the original structure.
Plotting posterior predictive models:
averagespread_prediction_2 <- posterior_predict(
interaction_model,
newdata = data.frame(farmprice = c(10, 10),
productname = c("Potatoes", "Strawberries")))
mcmc_areas(averagespread_prediction) + xlab("farmprice") +
scale_y_discrete(labels = c("Potatoes", "Strawberries"))
Visual examinations: MCMC trace plots, density overlay, autocorrelation function There are too many to list, so we show 3 productnames.
par(mfrow = c(2, 2))
mcmc_trace(interaction_model, pars=c("productnameStrawberries", "productnamePotatoes", "productnameOranges"), size = 0.5)
mcmc_dens_overlay(interaction_model, pars=c("productnameStrawberries", "productnamePotatoes", "productnameOranges"))
mcmc_acf(interaction_model,pars=c("productnameStrawberries", "productnamePotatoes", "productnameOranges"))
The MCMC’s are looking stable!
Numerical examinations: effective sample size and Rhat
neff_ratio(interaction_model)
(Intercept) I(1/farmprice)
0.0654875 0.0641500
productnameAvocados productnameBroccoli Bunches
0.1051000 0.0761125
productnameBroccoli Crowns productnameCantaloupe
0.0775625 0.0769750
productnameCarrots productnameCauliflower
0.0863375 0.0716250
productnameCelery productnameFlame Grapes
0.0768750 0.1148875
productnameGreen Leaf Lettuce productnameHoneydews
0.0780500 0.0769000
productnameIceberg Lettuce productnameNectarines
0.0747500 0.1050625
productnameOranges productnamePeaches
0.0714000 0.0935250
productnamePlums productnamePotatoes
0.1191250 0.0793875
productnameRed Leaf Lettuce productnameRomaine Lettuce
0.0796500 0.0738000
productnameStrawberries productnameThompson Grapes
0.0766875 0.1187750
productnameTomatoes I(1/farmprice):productnameAvocados
0.0937250 0.0765875
I(1/farmprice):productnameBroccoli Bunches I(1/farmprice):productnameBroccoli Crowns
0.0649125 0.0655250
I(1/farmprice):productnameCantaloupe I(1/farmprice):productnameCarrots
0.0645750 0.0648875
I(1/farmprice):productnameCauliflower I(1/farmprice):productnameCelery
0.0646375 0.0651375
I(1/farmprice):productnameFlame Grapes I(1/farmprice):productnameGreen Leaf Lettuce
0.0875625 0.0647250
I(1/farmprice):productnameHoneydews I(1/farmprice):productnameIceberg Lettuce
0.0644625 0.0647375
I(1/farmprice):productnameNectarines I(1/farmprice):productnameOranges
0.0681125 0.0641750
I(1/farmprice):productnamePeaches I(1/farmprice):productnamePlums
0.0662875 0.0728000
I(1/farmprice):productnamePotatoes I(1/farmprice):productnameRed Leaf Lettuce
0.0714625 0.0647625
I(1/farmprice):productnameRomaine Lettuce I(1/farmprice):productnameStrawberries
0.0643375 0.0738750
I(1/farmprice):productnameThompson Grapes I(1/farmprice):productnameTomatoes
0.0855625 0.0670500
sigma
0.6506750
rhat(interaction_model)
(Intercept) I(1/farmprice)
1.0008736 1.0009001
productnameAvocados productnameBroccoli Bunches
1.0004240 1.0007940
productnameBroccoli Crowns productnameCantaloupe
1.0006821 1.0008274
productnameCarrots productnameCauliflower
1.0006600 1.0008309
productnameCelery productnameFlame Grapes
1.0006684 1.0005168
productnameGreen Leaf Lettuce productnameHoneydews
1.0007473 1.0007326
productnameIceberg Lettuce productnameNectarines
1.0008062 1.0004947
productnameOranges productnamePeaches
1.0007899 1.0007152
productnamePlums productnamePotatoes
1.0003909 1.0008420
productnameRed Leaf Lettuce productnameRomaine Lettuce
1.0006249 1.0007640
productnameStrawberries productnameThompson Grapes
1.0007227 1.0005306
productnameTomatoes I(1/farmprice):productnameAvocados
1.0006794 1.0006770
I(1/farmprice):productnameBroccoli Bunches I(1/farmprice):productnameBroccoli Crowns
1.0009105 1.0008701
I(1/farmprice):productnameCantaloupe I(1/farmprice):productnameCarrots
1.0009098 1.0008901
I(1/farmprice):productnameCauliflower I(1/farmprice):productnameCelery
1.0009173 1.0008738
I(1/farmprice):productnameFlame Grapes I(1/farmprice):productnameGreen Leaf Lettuce
1.0006768 1.0008951
I(1/farmprice):productnameHoneydews I(1/farmprice):productnameIceberg Lettuce
1.0008971 1.0009061
I(1/farmprice):productnameNectarines I(1/farmprice):productnameOranges
1.0008294 1.0008960
I(1/farmprice):productnamePeaches I(1/farmprice):productnamePlums
1.0009041 1.0007183
I(1/farmprice):productnamePotatoes I(1/farmprice):productnameRed Leaf Lettuce
1.0008966 1.0008714
I(1/farmprice):productnameRomaine Lettuce I(1/farmprice):productnameStrawberries
1.0008922 1.0007620
I(1/farmprice):productnameThompson Grapes I(1/farmprice):productnameTomatoes
1.0007393 1.0009009
sigma
0.9999916
While all rhats are 1<rhat<1.05, most of the neff_ratios are less than 0.1, which raises suspicions. Nevertheless, we’ll continue.
Posterior credible intervals:
posterior_interval(interaction_model, prob = 0.90)
5% 95%
(Intercept) -48.0849972 5.9595760
I(1/farmprice) 131.5788743 196.1286292
productnameAvocados -64.9287114 6.1223660
productnameBroccoli Bunches -8.2661768 50.6393634
productnameBroccoli Crowns -67.2684961 -7.7113733
productnameCantaloupe 140.7116421 199.5125428
productnameCarrots 170.3538339 233.2132419
productnameCauliflower 91.9004828 148.5577839
productnameCelery -64.1529247 -5.6126519
productnameFlame Grapes -0.3694624 74.6099091
productnameGreen Leaf Lettuce -17.7256860 41.3959064
productnameHoneydews 147.6527766 206.3706225
productnameIceberg Lettuce -51.3726753 6.6361871
productnameNectarines -80.1630429 -10.1574252
productnameOranges 204.3599478 261.0656192
productnamePeaches -69.8626095 -3.1717708
productnamePlums -9.5461900 65.7297610
productnamePotatoes -15.9331257 44.5522883
productnameRed Leaf Lettuce 11.1404345 71.2498646
productnameRomaine Lettuce 23.9551728 81.9085423
productnameStrawberries -53.6868233 5.2961016
productnameThompson Grapes -41.6769656 36.1341603
productnameTomatoes -35.7999253 29.8224292
I(1/farmprice):productnameAvocados -100.0683852 -28.0259388
I(1/farmprice):productnameBroccoli Bunches -90.2873195 -25.1477241
I(1/farmprice):productnameBroccoli Crowns -52.3189675 13.0967386
I(1/farmprice):productnameCantaloupe -155.4599838 -90.6705800
I(1/farmprice):productnameCarrots -181.8281270 -116.7660726
I(1/farmprice):productnameCauliflower -119.4527145 -54.6587844
I(1/farmprice):productnameCelery -65.6026194 -0.4458604
I(1/farmprice):productnameFlame Grapes -100.9623967 -23.2382809
I(1/farmprice):productnameGreen Leaf Lettuce -75.3185316 -10.4345943
I(1/farmprice):productnameHoneydews -160.8630953 -95.9622282
I(1/farmprice):productnameIceberg Lettuce -83.0342194 -18.0872839
I(1/farmprice):productnameNectarines -53.5348779 13.5963644
I(1/farmprice):productnameOranges -189.3568623 -124.7797404
I(1/farmprice):productnamePeaches -61.2353438 4.8749041
I(1/farmprice):productnamePlums -93.9098907 -24.4447085
I(1/farmprice):productnamePotatoes 71.6730073 140.7311885
I(1/farmprice):productnameRed Leaf Lettuce -80.2638983 -15.3207883
I(1/farmprice):productnameRomaine Lettuce -86.5549448 -21.7666723
I(1/farmprice):productnameStrawberries 9.8136059 79.8301458
I(1/farmprice):productnameThompson Grapes -64.4795998 12.7960180
I(1/farmprice):productnameTomatoes -52.7397275 13.7217412
sigma 85.3865777 87.0003819
Most of the credible intervals do not include 0, which is a good sign. However, quite a few of them are very wide, which raises some alarms.
Posterior predictive check:
pp_check(interaction_model)
This model seems to fit better than our first model did, although it’s not as stable as the first model was. Visualizing the posterior predictive model for the averagespread at farmprice = $3 for 5 random products:
products <- sample(unique(produce$productname), 5)
produce_subset <- subset(produce, productname %in% products & farmprice == 3)
new_data <- data.frame(
productname = products,
farmprice = c(3, 3, 3, 3, 3)
)
predict_product <- posterior_predict(
interaction_model,
newdata = new_data)
head(predict_product)
1 2 3 4 5
[1,] 27.85177 14.57094 254.50606 32.15601 203.43253
[2,] 137.30504 -50.27387 88.36031 -42.65681 175.63443
[3,] 47.82571 129.56534 246.68158 218.31116 206.88012
[4,] 192.59296 -20.23407 123.40767 157.71656 125.15852
[5,] 139.56300 -21.97496 329.65007 137.54000 76.73732
[6,] 142.44043 106.42295 171.00168 90.55963 70.26058
mcmc_areas(predict_product, prob = 0.8) +
ggplot2::scale_y_discrete(
labels = products)
Let’s perform a loo comparison.
Predictive accuracy: ELPD Calculate the cross-validated expected
log-predictive densities (ELPD) and compare with the
interaction_model:
elpd_linear <- loo(produce_model)
elpd_interaction <- loo(interaction_model)
loo_compare(elpd_linear, elpd_interaction)
elpd_diff se_diff
interaction_model 0.0 0.0
produce_model -5378.2 128.3
The interaction model performs significantly better against the data than the linear model does. This can be explained by the fact that the trend of the scatterplot changes depending on the productname. Each product has a unique relationship between farmprice and averagespread, so it makes sense that the interaction model would perform better.
Dataset downloaded from Kaggle at https://www.kaggle.com/datasets/everydaycodings/produce-prices-dataset by KUMAR SAKSHAM; originally sourced from http://www.producepriceindex.com.